**CUNEF**
Practice III - Retail score analysis
Machine Learning</strong>
Pablo Mazariegos Reviriego - pablo.mazariegos@cunef.edu
Mario Sabater Pascual - mario.sabater@cunef.edu

In this Machine Learning practice we will be working with the Yelp dataset. The whole practice will be composed by the following notebooks:
This notebook will go, from reading the files in JSON format, to present a problem that will be interesting and possible to approach with the availabe data and through machine learning. The index goes as it follows:
import json
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from matplotlib import cm
import matplotlib.gridspec as gridspec
from pandas.plotting import autocorrelation_plot
import seaborn as sns
import plotly.express as px
import scipy.stats as ss
import math
import folium
from folium import Choropleth, Circle, Marker
from folium.plugins import HeatMap, MarkerCluster
import warnings
warnings.filterwarnings('ignore')
def json_to_dataframes(json_file, name, parts):
# Opening the JSON file and reading by line in a list
with open(json_file, "r", errors='ignore') as f:
data = f.readlines()
# Creation of a list to store the dictionaries
dict_list = []
# Iteramos sobre las líneas del fichero y cargamos cada línea como un diccionario
for line in data:
dict_list.append(json.loads(line))
# Creamos un dataframe a partir de la lista de diccionarios
df = pd.DataFrame.from_dict(dict_list)
# Calculamos el número de filas del dataframe
num_rows = df.shape[0]
# Dividimos el dataframe en 10 partes iguales y las almacenamos en una lista
df_list = [df[i:i+num_rows//parts] for i in range(0, num_rows, num_rows//parts)]
# Creamos una lista con los nombres de los dataframes
df_names = [f"{name}{i+1}" for i in range(parts)]
# Asignamos los nombres a los dataframes y los almacenamos en un diccionario
df_dict = {df_names[i]: df_list[i] for i in range(parts)}
# Devolvemos el número de filas y el diccionario de dataframes
return df_dict
To facilitate the reading of the json file and in order to not store a large volume of data in memory, we use the function json_to_dataframe. This function allows us to divide the json file in a number of equal parts (using the value "parts" od the function) and load only the ones we are interested in, in this way we can work with a percentaje of the data and visualize them in such a way that we can get an idea of the information contained in the selected json file.
Also the function allows to call the dataframe via df_dict["name 1"], where name is the second value we give to the json_to_dataframes function.
The data
As it is described in the yelp website, the dataset is "a subset of our businesses, reviews, and user data for use in personal, educational, and academic purposes. Available as JSON files, use it to teach students about databases, to learn NLP, or for sample production data while you learn how to make mobile apps".
The original datset contains information of:
For the purpose of this project we will discard the 200,100 pictures. Besides this, the rest of the information is registered in 5 JSON files, all labeled as "_yelp_academicdataset" +:
The dataset and its terms and conditions file are availabe to download in:
As we can deduct from we just read about the data, the file size makes it completely unefficient to work with the complete files and load them in the memory or even use them as a .csv file. Hence, the function afore defined, json_to_dataframe
We use the json_to_dataframes function, calling the subsequent files business1 to business 10, since we decided to divide the file into 10 parts.
df_dict = json_to_dataframes("../data/raw/yelp_academic_dataset_business.json", "business",10)
business1 = df_dict["business1"]
business1.head()
| business_id | name | address | city | state | postal_code | latitude | longitude | stars | review_count | is_open | attributes | categories | hours | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Pns2l4eNsfO8kk83dixA6A | Abby Rappoport, LAC, CMQ | 1616 Chapala St, Ste 2 | Santa Barbara | CA | 93101 | 34.426679 | -119.711197 | 5.0 | 7 | 0 | {'ByAppointmentOnly': 'True'} | Doctors, Traditional Chinese Medicine, Naturop... | None |
| 1 | mpf3x-BjTdTEA3yCZrAYPw | The UPS Store | 87 Grasso Plaza Shopping Center | Affton | MO | 63123 | 38.551126 | -90.335695 | 3.0 | 15 | 1 | {'BusinessAcceptsCreditCards': 'True'} | Shipping Centers, Local Services, Notaries, Ma... | {'Monday': '0:0-0:0', 'Tuesday': '8:0-18:30', ... |
| 2 | tUFrWirKiKi_TAnsVWINQQ | Target | 5255 E Broadway Blvd | Tucson | AZ | 85711 | 32.223236 | -110.880452 | 3.5 | 22 | 0 | {'BikeParking': 'True', 'BusinessAcceptsCredit... | Department Stores, Shopping, Fashion, Home & G... | {'Monday': '8:0-22:0', 'Tuesday': '8:0-22:0', ... |
| 3 | MTSW4McQd7CbVtyjqoe9mw | St Honore Pastries | 935 Race St | Philadelphia | PA | 19107 | 39.955505 | -75.155564 | 4.0 | 80 | 1 | {'RestaurantsDelivery': 'False', 'OutdoorSeati... | Restaurants, Food, Bubble Tea, Coffee & Tea, B... | {'Monday': '7:0-20:0', 'Tuesday': '7:0-20:0', ... |
| 4 | mWMc6_wTdE0EUBKIGXDVfA | Perkiomen Valley Brewery | 101 Walnut St | Green Lane | PA | 18054 | 40.338183 | -75.471659 | 4.5 | 13 | 1 | {'BusinessAcceptsCreditCards': 'True', 'Wheelc... | Brewpubs, Breweries, Food | {'Wednesday': '14:0-22:0', 'Thursday': '16:0-2... |
Here, by observing the first 5 rows of business we can make up an idea of what info the file contains. As can be also deducted from the name, the file contains information about the different businesses recorded in the database, we do have the basics being theses the name, adress, city, state, if it is opened or not or the schedule, as well as well as others that can be more interesting for our purposes:
Ratings distribution
# Create a figure with two subplots
fig, axs = plt.subplots(2, 1, figsize=(15, 15), sharex=False)
# Plot the first graph having it done with business1 (10% of the data)
# Get the distribution of the ratings
x = business1['stars'].value_counts()
x = x.sort_index()
# Plot the bar chart
axs[0] = sns.barplot(x=x.index, y=x.values, alpha=0.8, ax=axs[0])
axs[0].set_title("Star Rating Distribution 10% of the business")
axs[0].set_ylabel('# of businesses', fontsize=12)
axs[0].set_xlabel('Star Ratings ', fontsize=12)
# Adding the text labels
rects = axs[0].patches
labels = x.values
for rect, label in zip(rects, labels):
height = rect.get_height()
axs[0].text(rect.get_x() + rect.get_width()/2, height + 5, label, ha='center', va='bottom')
# Open the JSON file in read mode
with open('../data/raw/yelp_academic_dataset_business.json', 'r') as f:
# Initialize the ratings list
ratings = []
# Iterate over the lines in the file
for line in f:
# Load the line as a JSON object
data = json.loads(line)
# Append the rating to the list
ratings.append(data['stars'])
# Get the distribution of the ratings
x1 = pd.Series(ratings).value_counts()
x1 = x1.sort_index()
# Plot the bar chart
axs[1] = sns.barplot(x=x1.index, y=x1.values, alpha=0.8, width=0.8, ax=axs[1])
axs[1].set_title("Star Rating Distribution 100% of the business")
axs[1].set_ylabel('# of businesses', fontsize=12)
axs[1].set_xlabel('Star Ratings ', fontsize=12)
# Adding the text labels
rects = axs[1].patches
labels = x1.values
for rect, label in zip(rects, labels):
height = rect.get_height()
axs[1].text(rect.get_x() + rect.get_width()/2, height + 5, label, ha='center', va='bottom')
Number of reviews distribution
Let's visualize the distribution of number of reviews that a business has, we have applied log since the distribution is extremely skewed.
# Create a figure with two subplots
fig, axs = plt.subplots(1, 2, figsize=(12, 4))
# Plot the first graph having it done with business1 (10% of the data)
sns.distplot(business1['review_count'].apply(np.log1p), ax=axs[0])
axs[0].set_title("Distributions number of reviews - 10% of business (business1)")
#Second graph with the whole json:
# Read the json file
with open("../data/raw/yelp_academic_dataset_business.json", "r") as f:
businesses = f.readlines()
# Remove the trailing whitespaces and parse the json objects
businesses = [json.loads(business.strip()) for business in businesses]
# Extract the review counts from the businesses
review_counts = [business["review_count"] for business in businesses]
# Plot the distribution plot
sns.distplot(review_counts, ax=axs[1])
axs[1].set_title("Distributions number of reviews 100% of business")
# Adjust the spacing between the two subplots
fig.tight_layout()
As we stated before, the Yelp dataset contains information of 150.346 businesses. In order to have an idea of how the data looked like without using too much memory, we used the funtion json_to_dataframe, by doing this we obtain a smaller dataframe.
Looking at the four previous graphs (Star rating distribution and Number of reviews distribution). We can conclude that at first, the business1 dataframe (which is a the first 10% of the business json) is a representative sample of it. We find the main difference in the number of reviews distribution, in which the whole data has business with more than 7k reviews, but we can understand those almost as a outlier, since the main density is on the extreme left of the distribution graph.
Whenevere is possible, we will use the whole dataset, otherwise, the percetaje of the data selected (in this case 10% of business) will be used.
Business Map visualization
One of the main information provided by the business dataset is the location of the businesses, with variables as address, city, state, zipcode, longitude and latitude. The more accurate and easy to use for a map visualization is the last two.
# Open the JSON file in read mode
with open('../data/raw/yelp_academic_dataset_business.json', 'r', errors='ignore') as f:
# Create an empty marker cluster
mc = MarkerCluster()
# Iterate over the lines in the file
for line in f:
# Load the line as a JSON object
data = json.loads(line)
# Check if the latitude and longitude are not null
if not math.isnan(data['longitude']) and not math.isnan(data['latitude']):
# Add a marker to the cluster
mc.add_child(Marker([data['latitude'], data['longitude']]))
Dotmap (whole dataset)
# Create a Folium map
dotmap = folium.Map(location=[38.889722, -77.008889], tiles='cartodbpositron', zoom_start=3)
# Add the marker cluster to the map
dotmap.add_child(mc)
# Display the map
dotmap